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In this paper, we study the role of surface of the globule and the role of interactions with the 
solvent for designed sequence heteropolymers using random energy model (REM). We investigate 
the ground state energy and surface monomer composition distribution. By comparing the freezing 
transition in random and designed sequence heteropolymers, we discuss the effects of design. Based 
on our results, we are able to show under which conditions solvation effect improves the quality 
of sequence design. Finally, we study sequence space entropy and discuss the number of available 
sequences as a function of imposed requirements for the design quality. 



I. INTRODUCTION 

The simplest concept taught to students about protein 
structure is that hydrophobic monomers are mostly in- 
side water-soluble globules, while hydrophilic monomers 
are mostly on the surface. This beautiful idea was around 
for over half a century [l|, |2|] , everyone agrees that it rep- 
resents a cornerstone for our understanding of proteins 
[|| - and yet it is somehow neglected in the most so- 
phisticated theories of heteropolymers with quenched se- 
quences. Here, we have in mind the train of works which 
started from pioneering contributions by Bryngclson and 
Wolynes [J] and by Shakhnovich and Gutin [5j]. The in- 
sight of the former authors [1] was to recognize the deep 
connection between protein field and that of spin glasses 
and to apply the Random Energy Model developed by 
Derrida j6j; the contribution of the latter [5[ was to ac- 
tually derive the REM approximation for a consistent 
microscopic model of a heteropolymer with independent 
random interactions. By now, it is understood that REM 
is a well controlled mean field approximation for the large 
compact heteropolymer [7[ ■ The important part of het- 
eropolymer theory was also the idea of sequence design, 
which was used both to better model proteins and to test 
heteropolymer properties in general [8[. 

What is important to emphasize is that heteropoly- 
mer freezing and sequence design theories operate within 
the so-called volume approximation, neglecting surface 
terms in energy. Our goal in the present work is to in- 
vestigate, for the simplest tractable model, the interplay 
heteropolymer freezing and sequence design with prefer- 
ential solvation of some monomer species on the surface 
of the globule. In fact, even for random sequences pref- 
erential solvation was not included in REM-based het- 
eropolymer theory until very recently [9(. Our work is 
ideologically a sequel to the paper [9(, and we will use 
the ideas of that work. 

In the recent series of works [H| , sequence design 
was discussed (under a different name of "coloring" ) in a 
slightly different prospective, with an eye on chemically 
preparing protein-like copolymers. The solvation effect 
was given a very prominent role in these works. One of 
the goals of our work is to make a closer link between var- 



ious implementations of the sequence design paradigm. 

The paper is organized as follows. First we will in- 
troduce the solvation model (section III Al and Fig. [1} . 
Then we will talk about sequence design technique and 
how it is affected by the solvation (section III Bp . Surface 
monomer composition distribution is obtained in section 
IIII Al Our major results are summarized in the phase 
diagram of the system, sketched in the Fig. [3J Finally, 
we discuss in section [V] the availability of sequences as a 
function of their quality characterized by the energy gap 
between their ground state and the majority of other 
states. 



II. THE MODEL 
A. Energy: bulk and surface terms 

In our model, each monomer is assigned a quenched 
random variable a, which represents its monomer type. 
For the random sequence, we assume that a for each 
monomer is drawn from some probability distribution 
p (ct) . For simplicity, we restrict p (a) to have zero aver- 
age J up (a) da — and unit variance J a 2 p(a)da = 1. 
There are 20 possible values of a for natural proteins 
since the number of amino acids is 20. Theoretically it is 
convenient to consider a continuous distribution of a or 
a discrete distribution of just 2 monomer types. 

In our model, energy of the system consists of con- 
tributions from direct contact between monomers and 
of the contribution of contacts between monomers and 
solvent. The former, contact energy, has a "homopoly- 
meric" strong average attraction part B independent on 
monomer type, and a "heteropolymeric" contribution 
—SBoiGj with amplitude SB. We assume that B is suffi- 
ciently large, such that the globule is quite dense, and the 
contacts with the solvent take place only on the surface 
of the globule. We mostly look at the case SB > 0, such 
that similar monomers attract each other. We assume 
that each contact with the solvent provides energy — To",;. 
Thus, since T > 0, monomers with a > are hydrophilic, 
while those with a < are hydrophobic. Thus, Hamil- 
tonian of our model depends on the sequence, presented 
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FIG. 1: (Color online) Illustration of the model. Monomers 
are connected by covalent bonds, and monomer type is pre- 
sented by the shade. Surface is enriched with hydrophilic 
monomers, such that the distribution of surface monomers 
/(a) is shifted compared with the bare distribution p(a). 



by seq = {<Xj}, and conformation, specified by positions 
of all monomers conf = {r^}. The Hamiltonian reads 



N 



K 



H (seq, conf) = ^ (B - 8Ba t a 3 ) A l3 — T ^ m . (1) 



i£G 



Here G is the set of K ~ TV 2 / 3 monomers located on the 
surface and, therefore, exposed to the solvent. Ay = 
A {Yi — Yj) is contact map defined as: 

J 1 Yi and Yj are nearest neighbors . . 

I otherwise 

As we mentioned, we only consider the most compact 
conformations, and we assume there are Q contacts in 
every conformation, so Ay = Q. For compact glob- 
ule, Q ~ N. For simplicity, we just use N to denote the 
number of contacts in a conformation. 

Apart from the surface term, this model is equivalent 
to the one considered earlier, e.g., [l2j]. A cartoon of the 
model is sketched in Fig. [TJ 

Polymer chain with Hamiltonian |T]), with quenched 
sequence and in the statistical equilibrium in terms of 
conformation, will exhibit some preference of hydrophilic 
monomers towards the surface - as long as T > 0. 
The way to characterize this quantitatively is to address 
the statistics of a values for surface exposed monomers. 
Namely, we will be interested in distribution / (a) of sur- 
face monomers. We expect this distribution to be dif- 
ferent from the bare distribution of all monomers p {a) . 
Qualitatively, this is illustrated in the inset of Fig. [1] 

Of course, the effect of surface exposure to the solvent 
depends on the sequence. In general, hydrophilic effect 
adds frustration to the system. Indeed, placing a certain 
monomer on the surface necessitates placing its sequence 
neighbors close to the surface, while their identity, or 



their cr-values, might imply energetic preference for the 
interior region of the globule. To address this delicate 
sequence dependence, we will look at designed sequences. 



B. Sequence design 

Random sequence can be generated by a suitable Pois- 
son process, i.e., by the probability distribution 



N 



p(0) 
scq 



(3) 



By sequence design, we want to bias the sequence prob- 
abilities in a controlled fashion. This can be done in the 
following way. 

The sequence design procedure starts from the choice 
of the target conformation which we will denote *. In 
our consideration here, * might be any compact confor- 
mation, in other words, we ignore the difference of des- 
ignability for possible target conformations [l3[ - simply 
because designability and surface exposure are two inde- 
pendent effects and we do not want to further complicate 
our work by accounting for designability. There is no 
doubt that designability along with surface effects must 
be incorporated into the complete theory. Given a tar- 
get conformation *, we will consider a statistical Gibbs 
ensemble in which conformation {r^} = * is quenched, 
while the sequence {cr^} is annealed and comes to ther- 
modynamic equilibrium at the design temperature Td, 
which is not necessarily equal to the real temperature 
T. More specifically, we use the canonical sequence de- 
sign scheme [HI, in the sense that it is generated by the 
canonical ensemble of annealed sequences. This results 
in the following probability distribution of sequences: 



scq 



p(0) 
scq 



exp[-ff d (seq,*) /T d ] 
£ seq ' exp[-ff« (seq',*) /TV 



(4) 



where the denominator ensures normalization. Of course, 
this is just the scheme, the real features of the ensemble 
of designed sequences are controlled by the design Hamil- 
tonian, H d (seq, *). It might be the same Hamiltonian as 
([1]), but this is not at all necessary. Moreover, it is useful 
to explore the more general situation in which H d ^ H . 
We will use design Hamiltonian of the same functional 
form as ([T]), but with the different parameters 8B d , and 
T d (B d , although formally included for symmetry, does 
not play any role in sequence design, because this term 
in energy is sequence-independent): 



N 



K 



H d (seq,*) = J2( Bd - SB d ^i) ^ - T d °i ■ ( 5 ) 



i<3 



From this point of view, Td is just a parameter which con- 
trols the quality of design: when Td — * oo, ensemble of de- 
signed sequences is essentially the same as random, while 
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at lower Td designed sequences are statistically strongly 
biased by optimization the design energy. 

We work on artificial model proteins design. In vivo 
De Novo design of globular protein has long been suc- 
cessfully carried out (see, for example [14[ for a four-helix 
protein design from first-principle) . The off- lattice model 
sequence design works ((l5l.[l6j) on a 60-residue SH3 pro- 
tein domain showed that the folding quality of designed 
sequences vary for this small protein. 

C. High Td expansion of the sequence-averaged 
target state energy and Random Energy Model 

First let us look at the energy of target conformation 
*, averaged over the ensemble of designed sequences: 

(£*(seq))=^P* q tf(seq,*) . (6) 

scq 

As in we can evaluate this to the lowest order in 
high Td expansion: 

(£*(seq)) = (H) + ±r{{H d H)-{H d ){H)) 

= NB - — (N6BSB d + KTT d ) . (7) 

d 

We see that sequence design, on average, leads to the 
lowering of the target state energy. The only novelty, 
albeit quite trivial, compared at the volume approxima- 
tion [12], is that both contact energy and surface energy 
terms contribute to the target energy decrease. 

A'priori, one could think that the design lowers not 
only the target state energy, but also energies of some 
other states - particularly, those similar to the target 
state. It is well known, however, that because of the ge- 
ometry of compact conformations there are not many suf- 
ficiently similar compact conformations and, therefore, 
the statistics of energies of other conformations, to a good 
approximation, remains unaffected by the design (see [ItJ 
for further more detailed discussion of this point). This 
approximation is equivalent to REM. We will work within 
this REM approximation. 

III. MONOMER DISTRIBUTIONS INSIDE AND 
ON THE SURFACE 

A. Design by solvation 

As stated above, our major goal is to address surface 
effects in terms of the distribution of a values among 
surface monomers. With the designed sequences, wc can 
first ask - what is the distribution of surface monomers 
p*{cr) just in the design state, when conformation is 
frozen? This is of course fundamentally simple ques- 
tion, because the statistical mechanics of sequences at 
quenched conformation is not frustrated [17] . Basically, 



what one has to do is to take probability distribution of 
sequences and to integrate out all spin variables {e^} 
except surface monomers, i.e., all i except those belong- 
ing to the target conformation surface set G*. 

As a warm-up, let us perform this procedure for the 
simple yet important special case of design Hamiltonian, 
in which we set SB d = 0. In this case, design affects 
only surface monomers. The probability distribution of 
sequences is tremendously simplified, it can be factorized 
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= n pfa) n p>o . ( g ) 

j$G* ieG* 

which means that the ensemble of monomers in the bulk 
of the globule remains unaffected, distributed as p(a), 
while every surface monomer, independently of others, is 
distributed as 

p*(a) = cp(a)exp(T d a/T d ) , (9) 

where c is the normalization factor, c = 
1/ Jp (a 1 ) exp (T d a' /Td) da'. Thus, for all monomers, 
including N — K monomers that were in the bulk during 
the design process, and K monomers that were on the 
surface, the overall distribution of a reads 

(N-K)p(a)+Kp*(a) 

Ptot(O-) = — 

= P(*) + §\P*(*)-P(cr)] ■ (10) 

This result, Eq. ([9]), indicates that even this simpli- 
fied design procedure, with 5B d = 0, favors the hy- 
drophilic monomers on the surface, because for hy- 
drophilic monomers with a > 0, we have p* (a) > p(cr). 
Compared with the bare monomer distribution p (a) , hy- 
drophilic monomers have larger probability to appear on 
the surface of target conformation. 

The simplified 5B d = design scheme is reminiscent 
of the method used in [l(| [ll[ , in which all the surface 
monomers of target conformation (G* in our notation) 
are made hydrophilic while all the monomers inside the 
globule are hydrophobic. In our more general consider- 
ation, it is just more probable but not necessary for the 
surface exposed monomers to become hydrophilic during 
the design. The model of the works corresponds to 
the Td — * limit of our theory. 

B. Surface monomer distribution in the ground 
state 

Let us continue examination of the simplified design 
scheme, with SB d = 0, when only surface energy biases 
the choice of sequences (design by solvation). Our goal 
now is to find the surface energy correction terms of the 
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ground state energy and, as the major step in this direc- 
tion, we need to consider the surface monomer distribu- 
tion f(a) in the ground state. One should realize that in 
the ground state, the set of surface exposed monomers 
may or may not be similar to the set of monomers ex- 
posed to the surface during the design; in other words, 
/(<t) might be similar to or might be quite dif- 

ferent from it. Therefore, there are two contributions to 
the surface energy, one due to the monomers exposed to 
the surface, and the other due to the fact that selection 
of surface monomers affects the monomer composition 
left inside the globule. To express this quantitatively, we 
write for the arbitrary state the equation similar to (|10[) 



Ptot(o-) =Pi n (a-) + — [/(cr) -p in (<r)] , (11) 

where Pi n (c) is the distribution of monomers left inside. 
Comparing equation (fTO)) and (fTTj) , we find 



K 

N 



b» -/(*)] 



(12) 



As everywhere, we neg lcct here the terms O ((K/N) 2 ). 
Thus, we directly see already here how the deviation of 
the state from the target state comes into play. 

To compute f(a) for the ground state, we adapt for 
designed sequences the procedure which was developed 
in the work [9( for random sequence solvation. To make 
our work self-contained, we briefly outline major steps. 

We begin by constructing a separate REM, called sub- 
REM, for each possible choice of surface monomers, G. 
Indeed, for each G there are still many conformations 
available. The number of such conformations is naturally 
written in the form M G = e Ns ~ KuJ G^ w here s is con- 
formational entropy per monomer in volume approxima- 
tion, and lo g is entropy loss due to confinement of some 
monomers on the surface. Although M G is much smaller 
than the total number of conformations, M = e sN , but 
the entropy loss caused by fixation of G monomers on the 
surface is only a surface effect O(K). Following [§], we 
adopt a bold approximation that lo g —lu is independent 
on G; in this approximation, counting all states shows 
that uJ= In (Ne/K). 

For each sub-REM, energies of all M G states are ran- 
dom in the sense that they depend on random sequence 
realization, and it is reasonable to assume [9J that these 
energies are independent Gaussian variables, because 
each energy, according to formula |TJ), has of order N mu- 
tually statistically independent bulk contributions and 
of order K independent surface contributions. To write 
down the resulting Gaussian distribution, we should de- 
termine corresponding mean and variance. The mean is 
found by averaging the bulk terms (£? — SBaiaj) over the 
distribution p m (a) plus averaging the surface terms — rcr.; 
over the distribution /(cr), and the variance is similarly 
found by averaging the second moment. This results fi- 
nally in the following Gaussian distribution of random 



energy: 

w G {E) oc exp 
where 



[E-(NB-KT 1G )\ 
2NSB 2 + 2K5B 2 (3 G 



(13) 



la = J <yf (c) da , 

fa = 2 / a 2 \p* (a) - f (a)} da . (14) 



Notice that dependence on the surface monomer group 
G is only through the surface monomer distribution, and 
that the dependence on design is due to the p*{a). 

Every sub-REM has a certain ground state energy 
E g (G) — E g {f(a)}, which is just the lowest of Mq ran- 
dom energies drawn independently from the distribution 
(fl~3|) . Energy E g {f(a)} is still a random variable, its 
probability distribution can be found from the so-called 
extreme value statistics [l8[ (see also Appendix [A)) : 



W G {E) 
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i exp 




— exp 


[Tb. 





(15) 



where T b = SB/V^s and 5E g = E - Efv{f{a)} is the 
deviation of ground state energy from its most probable 
(typical) value, which includes both volume and surface 
contributions: 



E g yp {f(<7)} = N(B- 2sT h ) 

+ K(UT h .-T lG + sT ir p G ) 



(16) 



Notice that the only dependence of probability distribu- 
tion Wg on the surface monomers G is hidden in j G and 
(3 G inside the most probable energy E t g yp {f(a)}, and the 
dependence on design is also there inside [3 G (see formula 
(fT4| . We also mention that Tf r = SB j\[2s appearing here 
as a parameter of the ground state distribution happens 
to have its physical meaning - it is volume approximated 
freezing temperature of the random sequence polymer 0] . 

The probability to get ground state energy anywhere 
below its typical most probable value (|16p is exponen- 
tially small. However, we try exponentially many times 
- namely, we have to choose the lowest among e KuJ = 
(Ne/K) K sub-REM ground states. Therefore, we have 
a good chance to find some particular sub-group G with 
energy noticeably below typical value (|T6|) . Essentially, 
what we have to do now is to resort second time to the 
extreme value statistics and find the expectation value 
of the lowest among the sub-REM ground states. It is 
convenient to perform this operation in a slightly differ- 
ent, but equivalent form. Namely, we note that the low 
energy tail of the ground state probability distribution, 
(fT5")) . is exponential, and, therefore, it looks effectively 
like Boltzmann distribution, with Tf r playing the role of 
temperature. 

It is useful to note here that treating the tail of the 
distribution (1151) as effective Boltzmann distribution with 
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temperature Tf r is reminiscent and essentially equivalent 
to the consideration given in the book [3[ and explain- 
ing the origin of phenomenologically discovered quasi- 
Boltzmann distribution over the ensemble of evolution- 
ary selected proteins. 

Returning to our argument, finding lowest among the 
EfP of the sub-REMs is equivalent to minimizing the 
effective "free energy" , in which the effective entropy is 
given by the number of ways to choose K monomers with 
distribution /(cr) from N monomers with distribution 
Ptot (cr): 

*{f(<r)} = -[ Ptot(cr) foM+(l-0)ln(l-$]dt7 . 

(17) 

where <f>(a) = K j (a) / N p tot {a) has the meaning of the 
fraction of monomers with type cr that are exposed to the 
surface. Including this effective entropy, we have now the 
effective "free energy" 



Ef P {f(v)} ~ T h .Ns{f(a)} . 



(18) 



We minimize this with respect to /(cr), subject to nor- 
malization condition J f(a)do = 1 and obtain 



N ptot ((7) 

K 1 + AeM"-) ' 



(19) 



where rj[ T (a) = 2sa 2 — (r/Tf r ) cr, and A is the Lagrange 
multiplier which has to be determined from the normal- 
ization condition J f(a)da — 1. Comparing this with 
the paper Q , we see that the only role of design in this 
case is the modification of monomer distribution: instead 
of bare distribution p{a) , we have now the modified one 
Ptot (c) • Let us see what are the consequence of this re- 
placement. 

Let us concentrate on the regime without the depletion 
effect, when </>(cr) < 1 at all values of a. This means 
that for any monomer type, only a small fraction of it is 
solvated to the surface region. Under such assumption, 
/(cr) can be approximated as 
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-Hr(ff) 



K 
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cexp 



T d 
— < 



p{&) 



(20) 

where we dropped for simplicity the cr-independent nor- 
malization factor. To gain some insight, let us look at 
/(cr) for a couple of simple examples of bare monomer 
probability distributions. Since real distribution involves 
a large number (20) of monomer species, we examine two 
limits of two monomer species and of continuous Gaus- 
sian distribution. 



1. Example: bimodal distribution 

In the simplest black-and-white model [lj| two types of 
monomers, one hydrophilic and one hydrophobic, appear 
with same probability: 

p(a) = l[S(a+l)+S(a-l)} . (21) 



Simple calculation shows that 
/(cr) cx 5 (cr + 1) e- r/Tt ' 
S( a -l)e r ^ 



K , fr d 

1 + ^tanhf- 



+ 



(22) 



We see that there are two effects bringing hydrophilic 
monomers to the surface, that is, increasing /(+1) on 
the expense of decreasing /(— 1). First effect is due to T 
and is measured by the ratio r/Tfr. This effect is present 
in random heteropolymer, has nothing to do with design, 
and is simply energetic: since it is more favorable for the 
cr > monomers to be on the surface, so the surface 
gets enriched with such monomers. The second effect is 
entirely due to design and it is governed by the design 
parameters T d /Td- This effect is washed way at large 
design temperature and it saturates at small Td- Notice 
that this design effect is only a surface effect, its maximal 
possible role is proportional to K/N. This is because the 
best one can do with this type of design is to shift the 
monomeric composition by the amount about K/N. 



2. Example: Gaussian distribution 



The opposite limit is presented by 



p(a) 



1 / cr 

exp 



2tt 



(23) 



The corresponding surface monomer distribution is the 
following: 



/O) oc 




- 1 



x exp 



(24) 



In Fig. [5J we made a plot of an example of /(cr) as a 
function of cr for the Gaussian case. For comparison, 
random sequence solvation {T d — + oo) is also included. It 
can be seen that design enriches surface with hydrophilic 
monomers such that the distribution is shifted toward 
hydrophilic region compared to no design case, which by 
itself is already a shift relative to the bare distribution 
p(cr). 



C. Sequence design by both solvation and 
monomer contacts: mean field approximation 

In the preceding section, we considered sequences de- 
signed by the effect of preferential solvation of certain 
monomer types under the chain preparation conditions. 
This corresponds to having only the second term in the 
design Hamiltonian , or having 5B d = 0. By contrast, 
sequence design by monomer-monomer contacts, i.e. the 
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FIG. 2: (Color online). A comparison of original distribution 
and surface monomer distribution with design, together with 
the case of without design. Design favors the monomers with 
hydrophilic type. 



limit of T d = and 8B d ^ was considered in the liter- 
ature before (see, e.g., review article [l7[ and references 
therein) . 

In this section, we consider the general case, when both 
volume and surface terms of the design Hamiltonian ([5]) 
are present. To make the argument, we resort to the 
mean-field approximation for the design system. That 
means, we consider design by an effective field which 
couples to the variables a and acts differently on sur- 
face and bulk monomers. Since design Hamiltonian ([5]) is 
quadratic in er, the said "design field" is proportional to a 
- an average value of a defined self-consistently. It is then 
easy to realize that this field vanishes in the bulk, because 
in our model design does not affect overall composition 
of the chain, and, therefore, efbuik = 0. Therefore, the 
mean field approximated design Hamiltonian reads 



H, 



mean field 



(SB d za SUTf + T d ) V a 



K 

E 



(25) 



Here, <7 sur f is the average a of surface monomers (that is, 
of monomers which happen to be on the surface during 
the design process) and z is the coordination number (the 
number of neighbors) for surface monomers. 

Within the mean field approximation, the probability 
distribution of designed sequences ([4]) gets factorized into 
independent distributions of all monomers, just like in 
the 8B d — case. Accordingly, we obtain the distribution 
of surface monomers, similar to Eq. ©: 



p*(a) cx p(o) exp 



SB d za sur{ + T" 



(26) 



where we dropped for brevity the cr-independent normal- 
ization factor. Now, the value of ef sur f must be deter- 



^surf 



ap*(a)da 



J ap(a) exp 



SB" 



da 



Jp(cr)exp 



SB d za allrf +r d 



da 



(27) 



To gain an insight into the properties of the latter equa- 
tion, it is useful to consider the examples of bimodal 
and Gaussian distributions for p{a). We do that a few 
lines below, in section llll D[ but here we notice that once 
(7 sur f is determined, the rest of the analysis follows au- 
tomatically along the lines of our previous consideration 
in the section IIII B[ Indeed, all we needed to know to 
implement the result (|19p is the overall monomer dis- 
tribution ptot(c), which is known as soon as p*{a) is 
determined (see Eq. (jTUJ) ) . Therefore, we can directly 
use our results Eqs (I20|) . (|22|) . (I24|) with the replacement 
T d -> T d + SB d za surf = T' d . 

With that in mind, let us return briefly to the deter- 
mination Of <7 sur f . 



D. Implementing the self-consistency condition 

1. Example: bimodal distribution 



For bimodal p(a), Eq. (|27|) becomes 

~5B d za smi + Y d 



oWf = tanh 



(28) 



At T d = 0, this equation has non-trivial non-vanishing 
solutions only if SB d z/Td > 1, in which case there are 
automatically two solutions of the opposite sign. That 
means, the non-zero CT sm f in this case results only from 
the spontaneous symmetry breaking, because without 
T d the system has no preference for hydrophobic or hy- 
drophilic monomers dominating the surface. The non- 
zero T d > breaks this symmetry and yields always one 
and only one positive solution for <7 SU rf (and possibly two 
negative ones which we ignore because they have higher 
free energy). 

In this bimodal case, we have a SU rf < 1, which means 
that in the replacement T d — > T' d , the solvation term T d 
dominates if T d > SB d z. 



2. Example: Gaussian distribution 

For Gaussian p(a), Eq. ([27]) is easily explicitly re- 
solved: 



& surf 



T d /T d 



1 - zSB d /T d 



(29) 



This is usually very close to T d /Td, because (see next 
section HV| in most interesting regime close to the triple 



point of the phase diagram, the denominator is domi- 
nated by the unity. 



IV. FREE ENERGY AND PHASE DIAGRAM 
OF DESIGNED POLYMERS 

A. Preliminary remarks 

In this section, we will consider the possible phases 
of the heteropolymer whose sequence is designed as dis- 
cussed above. Similar problem in volume approximation 
is well known in the literature (see, e.g., review 17[ and 
references therein). Specifically, we will consider three 



phases and the transitions between them. We will sum- 
marize the relations between phases in terms of the phase 
diagram, Fig. [31 in variables Td and T, which describe, 
respectively, the ensemble of sequences and the ensemble 
of conformations for any given sequence. The relevant 
phases in the diagram are named, respectively, liquid-like 
globule, glassy globule, and folded globule. We remind to 
the reader that liquid-like globule is the state where great 
many conformations contribute to the partition function; 
glassy globule is dominated by one or a few conforma- 
tions, but those unrelated to the target conformation *; 
and folded globule is dominated by the target conforma- 
tion *. It is fairly obvious and illustrated in Fig. [3J that 
surface solvation effects do not change the topology of the 
phase diagram, but does affect the specific positions and 
shape of the corresponding phase transition lines; these 
surface-driven changes are the subject of our interest in 
this section. 

The temperatures of the transitions from liquid- like to 
glass-like and to folded globules are called glass temper- 
ature T g and folding temperature Tf, respectively. Our 
goal is to analyze the role of surface solvation effects and 
design on both T g and Tf. In other words, we want to 
calculate how surface corrections to Tf and T g depend 
on the design temperature Td- 

As regards the third phase transition line, that between 
folded and glassy globule phases, this line must be ver- 
tical in phase diagram. Indeed, both folded and glassy 
globules are zero entropy states, the transition between 
them cannot be driven by temperature change. On the 
phase diagram, like in Fig. [3j the corresponding phase 
boundary must be represented by the line parallel to the 
temperature axis. This argument does not rely on the 
volume approximation, and, therefore, remains valid in- 
dependently of the surface solvation effects. Therefore, 
this line of phase transition is entirely described by the 
value of design temperature at which there is the triple 
point, we call it T} 3 . We want to calculate also the sur- 
face contribution to this quantity. 

The way we approach the phase diagram is based on 
the idea that for any frozen globule phase, whether glassy 
or folded, the free energy coincides with energy, because 
entropy vanishes given the number of contributing states 
of order unity. On the other hand, the free energy of 



T. 



Liquid-like globule 



T for Gaussian 




FIG. 3: (Color online) Phase diagram of the heteropolymer 
system. There are three phases: random liquid-like globule, 
frozen glassy globule and folded globule. Surface term shifts 
the phase diagram for volume approximation. For bimodal 
distribution, glass temperature is design independent, while in 
Gaussian distribution, glass temperature increases for better 
design condition. 



the liquid-like globule we can find due to the property of 
REM that quenched averaged free energy is equal to the 
annealed average above the glass temperature. There- 
fore, what we shall do is to compute the annealed average 
free energy and to find temperature of entropy "catastro- 
phe" - at which entropy vanishes; that is the glass tem- 
perature. Of course, our goal is to address surface terms 
and design terms in this procedure. Following this pro- 
gram, we write the annealed average as a functional of 
the surface monomer distribution f(a) as 



F{f} = -Tln^, 



-E/T 



conf 



Tin / e sN ~ K ^ +Ns ^^w G {E)e~ E / T dE 



- F + F sulf {f} , (30) 

where 

F - 5B 2 

— = B- sT 

N 2T 

is the bulk contribution to the free energy, and 



(31) 



FsuAf} SB 2 f 2 N , 
^ = uT - — lap [a)da + —T j dap tot [a) 



1 



+ ln(l - 0) 



(32) 



is the surface contribution to free energy. In Eq. (|32|) . 



( \ 5B2 - 
V\v) = T^ro - —a , 



rp2 



T 



p*(a) 



P (<r) exp [^-a 
J p{u) exp ( Tjr-cr) da 



(33) 
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In deriving the free energy, we have used the saddle point 
approximation since above glass temperature, free energy 
is dominated by the saddle point of the partition func- 
tion. 

Optimizing F surf {/} yields 



1 



Kf*(a) 



(34) 



l + AenW N Ptot (a) ' 
and then optimal value of T sur f {/} evaluates to 

KT f gtotW d 

J 1+AeiC") aa 

(35) 

These results are similar to those obtained in the work 
@, except we have now non-random sequences, as man- 
ifested by the dependence on T&. In that work [9j, solva- 
tion effect in random sequences was treated as the re- 
sponse of the globule to the surface solvation "field" . 
The linear response regime is characterized by the sta- 
tistical independence of disordered parts of surface and 
volume energies. Saturation regime is characterized by 
the depletion effect, when preferential solvation of a cer- 
tain monomer species exhausts these monomers from the 
globule. For completeness, there is also a narrow range 
of the so-called weak response regime. Neither weak re- 
sponse nor saturation regimes are present for the black- 
and-white polymer model, with bimodal distribution of 
monomer types. 



B. No depletion: Glass temperature 

Let us first consider the case when solvation doesn't 
cause the depletion of any monomer type, that is, cj)(cr) <C 
1 at every a, then 

F smf {f*} ™ 2 



KT 



SB f f 
'Y? / ^V^dcr-ln / ptot(c) 



^ a) da . 

First we will discuss how the glass temperature T g is 
affected by the surface energy, relative to the volume ap- 
proximated value Tf r = SBj\[2s. Glass temperature must 

be determined from by the condition — dF g^ = 0. 



Since — ||r 



0, we can write (denoting temperature 



derivatives by prime sign) AT g 
Then 



T„ — Te, 



PL* 
f" 



Tfr 



~T^~ ~ N 2sT h J p^e-wA^da 
a p {a)d<7 — 



j p(a)e- r K'(< T )da 
-^ln J p(a)e-"^d(j , 



(37) 



where we have replaced ptot(c) with p(a) because AT g 
itself is already of order 0(K/N), and we neglect any 
higher order corrections. Design effect is present here 
through p*(cr). 

Relative to no solvation case, the order 0(K/N) cor- 
rection to glass temperature is positive; in other words, 
glass temperature is increased due to the solvation effect, 
so the surface effect makes ground state more stable. 

In the following part, we will further simplify and dis- 
cuss AT g using the examples of p(cr) ■ 



1. Example: bimodal distribution 

For bimodal p(a), J a 2 p*(a)da — 1, and design effect 
doesn't show up in glass temperature, 



AT„ = 










— tanh 




— In cosh [ — | 






\TfrJ. 



when <C 1 
when ^- > 1 



(38) 



When the solvation strength T is small, r/Tf r <C 1, 
this corresponds to the statistical independent region, in 
which surface term and volume term in a heteropolymer 
globule are roughly statistically independent. In this re- 
gion, ATg cx T 2 . 

The region of T/Tf r 3> 1 is saturation region. When 
solvation strength T is so large, essentially all the sur- 
face monomers are of hydrophilic type, as can be clearly 
seen from surface monomer distribution Eq. (|22[) : when 
r/Tfr 3> 1, surface monomer distribution function is 
dominated by hydrophilic term f(+l). In this regime, 
ATg becomes independent of T, because T is already so 
large that all surface places are occupied by hydrophilic 
monomers and further increase of T cannot change any- 
thing. 



2. Example: Gaussian distribution 

For Gaussian p(a), we can write J a 2 p*(a)da = 
(T' d /T d f + 1 ~ (T d /T d ) 2 (l + 2zSB d /T d ) + 1 ~ 
(T d /Td) 2 + 1, where the asymptotic form comes from 
the fact that Td/(zSB d ) ^> 1 since we work in the regime 



of T d > T d ' ' = dB d /V2s, where Tp"> is the triple point 
in volume approximation. Then we have 



K 

AT„ ~ — T fr 

9 N ir 



Td) 



6s 



25B 2 



(39) 



where we also used the fact that sCl. As in the work 
@, system with Gaussian distributed <j, unlike bimodal 
one, has the weak response regime at very small T, when 
F/Tfr <C s; in this regime, surface solvation is insignifi- 
cant. The region of r/Tf r ^> \/2As is the regime where 
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volume and surface disorder are statistically independent, 
and the result in this region, in terms of dependence on 
r, is similar to that of the bimodal distribution. 

Of course, the major difference from the bimodal ex- 
ample is that in Gaussian case, there is an additional 
term due to sequence design. That term increases glass 
temperature, which means design, as usually, makes for 
a more stable ground state. 



C. Triple point 



Now let us consider the folded region, and begin with 

(3) 

the surface-corrected triple point T\ , we want to see 



its change ATf ) = T£> - T d °'"> relative to T £ 
5B d J ' \p2s~ determined in volume approximation. In gen- 
eral, triple point is determined from the condition that 
glass temperature equals folding temperature, Tf = T g . 
Glass temperature, along with its surface corrections, is 
already known to us, see formula (|3T[) or its simplified ver- 
sions (f38l) and (f39|) . The folding temperature Tf should 
be calculated from (E^(seq)) = F{f*}. We take aver- 
aged ground state energy from formula ([7]) and we take 
F{f*} from Eqs. dSTJ) and in the latter (which is 

the surface part) we must neglect all order 0(K/N) cor- 
rections. The result reads 



-.(3) 



(3,0) 



,(3,0) 



AT 



(3) 



N d 



(3,0) 



SB8B d 



-J a 2 p*{a)dcr 



— hi J p(cr)e-" fc ( ff )dcr 



(40) 



.(3) 



From this formula, it is not even clear whether AT^' is 
positive or negative. As with other cumbersome results, 
let us look at the specific examples of p{cr). 



Example: bimodal distribution 



For bimodal distribution p{a) , we have 



ATj 3) ~ — T 

d N 


(3,0) 


rr d 


i 


SB5B d 


( K T (3,0) r 
1 N ± d SB 


( pd 

. 1W 


2SB J 


) K T (3,0) r 
\N ± d SB 


) pd 


~ 73s ) 



- — In cosh ( 



when t£- 

-Ifr 



< 1 



when jC- > 1 



("3) 

We see that the design effect increases AT d °>, pushes 
triple point to the right on the phase diagram Fig. 
[31 while the solvation effect acts in the opposite di- 
rection. Interestingly, in the statistical independence 
regime, when r/Tf r <C 1, the sign of AT d 3 ^ is determined 
by the competition of the design term T d / 5B d and folding 
term T/5B. Specifically, large design solvation strength 
T d /5B d > T/SB would make ATf' > and in this sense 
the design makes folded state more stable. From Eq. 



(41) 



(I38)) . we already know that glass temperature is indepen- 
dent of r in saturation region r/Tf r ^S> 1. Not surpris- 

(3) 

ingly, in this region, the sign of AT^ is also independent 
ofT. 



2. Example: Gaussian distribution 



When p(a) is Gaussian, we get 



AT. 



(3) 



K 

T 

N d 

-2s 



(3,0) 



SB d 



TT d 

5BSB d 

2 

(1- 



r 2 



2SB 2 
2zV2s) 



2s 



(42) 



The interesting and rather unexpected result is that the 
design effect, when it is very strong, might lead to reduc- 

(3) 

tion of T d , in other words, it might have an adverse ef- 
fect on the stability of the folded phase. Inspection of the 
origin of the negative term oc — (Td/SB d ) 2 shows that its 
origin is due to the fact that very strong solvation effect 
in design brings in a significant fraction of very strongly 
solvophilic monomers; even though only small fraction 
of them subsequently turns out inside the globule in the 
folded state, they nevertheless make the destabilizing ef- 
fect on the globule. We emphasize that such danger exists 
only when solvation effect in design is so strong that not 
only T d /SB d > T/SB, but sT d /SB d > T/SB. It is unclear 
if such situation is realistic. 



D. Folding temperature 

Next let us consider the folding temperature Tf away 

(3) 

but not far from the triple point. When Td < T) , we 
have 



5BSB d + %rr d = 



SB 2 
~2T 



K 



surf 



(43) 



T=T f 



In the vicinity of the triple point, when Td = T c 
AT d , where AT d « f Tf\ we have T surf | T=T/ 



(3) 



IT=T„,Tj=T. 



0,0), or 



5BSB d AT d SB 2 SB 2 



r (3) 
l d 



,(3) 



2T a 



which yields upon some algebra 



T f ~T„ 



\ 



5BSB d 



T d 



rp(3)rp \ 7^(3) 



(44) 



(45) 



In Fig. [3l the phase diagram of the system is sketched. 
In the regime of temperature T below Tf, the designed 
sequences will thermodynamically stable when folded to 
the target state. In the regime of T d > Tj 3 , sequences 
obtained will be either in frozen glassy state or in random 
liquid-like globule. 
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E. Depletion effect 

In preceding sections, we assumed no depletion ef- 
fect. Here we will see what happens if there is depletion. 
Depletion of monomers may happen for Gaussian p(cr), 
while in bimodal case, there is no depletion since the 
number of monomers for each monomer type is abun- 
dant. When depletion occurs, <fi — 1 for a > er m , and 
= for a < <r m , and this leads to 



p tot (a)da = K/N 



(46) 



The integration result is 



erfc 



V2 



<Hc, 7t)- 2 



— eric — = 
K \V2 



>o 

For the case of no design, T' d = 0, 

2K 



erfc MS = 

\V2J N 



(47) 
(48) 



Therefore, we have erfc (a m /^/2) < erfc (<7^/v2), and 
this gives a m > cr^, so with design, the surface monomers 
are more hydrophilic than without design. This makes 
physical sense and this is also consistent with no deple- 
tion case <j) <C 1, in which design favors the hydrophilic 
monomers on the surface. 



neglect this fact. This is not because designability is 
unimportant - it is very important indeed, but our goal 
is to look at the role of sequence solvation effect, so to 
make this task manageable, we have to sacrifice the des- 
ignability issue as a zeroth approximation. 

To find sequence space entropy, we consider se- 
quence space free energy — Td In Z , where Z = 
Ssoq ex P [—H d (seq, *)/7d] . Note that design is to a cer- 
tain conformation *, so in the partition function here, 
the summation runs over sequences. Sequence space en- 
tropy per monomer s soq can then be found using high 
Td expansion just in the same way as the calculation of 
(E ir (seq)}, formula ||7J). The result is given by 



d(-TdlnZ) 
8T d 



Inq 



SB d 



2?1 



(49) 



where q is the effective number of 'letters in the alphabet' 
determined from the total number of possible sequences 
of length N: Af scq — q . It is not difficult to show that 
q = — ^2 a p{cr) lnp(<r) (see Eq. d3j; q w 18 for real pro- 
teins) . 

The number of sequences is maximal if we impose no 
constraints on the quality of design, which means se- 
quence entropy has to be maximal when Td is at the 
triple point. Therefore, we can compute s™q X using for- 



mula 



at Td = T 



In q 



(3) 



The result reads: 



2AT 



(3) 



T 



(3,0) 



KT d 

N6B d2 



(50) 



V. SEQUENCE SPACE ENTROPY 

Sequence design, when it is realized computationally, 
or if it could be realized experimentally, helps finding se- 
quences with particularly low ground state energy. But of 
course there is a limit - there is always a sequence whose 
ground state energy is the lowest among all sequences, 
and, therefore, no design can possibly produce any se- 
quence with lower energy. More generally and more prac- 
tically, the lower ground state energy we want to obtain, 
the fewer sequences exist which can meet our demand. 
One may want to know how many sequences are there to 
choose from with any given ground state energy. Design 
paradigm provides the general method to solve such prob- 
lem. Indeed, we can compute the sequence space entropy 
(which is just the logarithm of the number of relevant se- 
quences) as a function of Td and as we also know the 
average ground state energy as a function of Td, we can 
determine the number of sequences as depends on their 
ground state energy. This procedure in volume approx- 
imation is described in the work [jjj- Here we want to 
consider how it is affected by the surface solvation effect. 

In principle, sequence space entropy depends quite 
strongly on the target state fold here denoted as *, this 
dependence is called designability of the fold (see, for ex- 
ample, recent work on this subject [13]). Here, we will 



where the ratio AX^ jT d ' should be taken from Eq. 
(|4"0")) or from the simplified versions of it (|4"Tj) or (|42[) . 

First, in the volume approximation, when there is no 
surface term, we have s™q X = \nq — s. This result is a 
very natural consequence of our neglect of the difference 
in designabilities between different folds. Indeed, volume 
approximation of Sgcq* indicates that the number of se- 
quences that can be designed for a given conformation * is 
e JVs seq _ N Beq /N con{i which means that all Af seq = q N se- 
quences are equally distributed between A4onf = e sN con- 
formations. This is because the fraction of sequences with 
ground state energy above (E+(seq)) ([7]) is extremely 
small, see appendix I A 11 so practically all sequences have 
their ground state energy around (E+(seq)). 

Second, we look at the role of surface effect. For 
simplicity, we restrict consideration to the most typical 
regime of statistical independence between surface and 
bulk contributions. For both bimodal distribution and 
Gaussian distribution, plugging AT^ 3 ' into Eq. (|5fl|) . we 
get the following simple result: 



sK 

= In o — s 

q N 



r 

SB 



5B d 



(51) 



This tells us that the surface solvation effect reduces the 
number of sequences, s™q X < bxq— s. This happens be- 
cause some of the sequences, with inadequate supply of 
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hydrophilic monomers, have their ground state energies 
above (E+(seq)) when we look at them carefully enough 
to notice their surface energy. Accordingly, the fraction 
of sequences with ground state at * is below its 'fair' 
share of e^ lnq ~ s,>N and even maximal sequence space en- 
tropy falls short of its volume approximated value In q— s. 
Only very careful design, at which T d /5B d = T /SB, would 
be able to provide the ensemble of sequences adequate 
to their solvation conditions, in which case the solvation 
effect does not increase energy and does not rule any 
sequences out of the competition. Notice that the con- 
dition T d /SB d = T/5B does not involve design tempera- 
ture, it specifies only the balance of solvation and bulk 
hetcropolymeric effects. 

Let us now look at the situation differently, namely, let 
us write down the folding temperature Tf in terms of s soq 
instead of T^. Indeed, Td is a purely technical concept 
which may or may not directly correspond to the exper- 
imental reality; for instance, design can be controlled by 
some analog of solvent quality instead of temperature. At 
the same time, sequence entropy is a very clear quantity, 
it is the number of sequences whose ground state stabil- 
ity corresponds to the temperature Tf. Simple algebra 
shows that 



Example: bimodal distribution 



T f = T n 



5BSB d 



(3) 



\ sT g T 



1 - 



lh\q 



J scq 



In q 



(52) 



This allows to re-interpret phase diagram, Fig. [3l with 
sequence entropy on the horizontal axis. 

Finally, it is known 0] that the quality of design is best 
characterized by the energy gap between the energy of the 
sequence in its purported target state and the average 
ground state energy 



Ae = 



nn\ Tg - (£*(seg)) 

N 



(53) 



Therefore, we should look at the relation between se- 
quence entropy and Ae. From the above results, we have 
found 



^scq 



= In q — s 1 



where the solvation related coefficients are given by 



SB d ' 



5B5B d 



and 



c 



J a 2 p*{cr)da + ^-\n J p{a)e- r ' iAa) da 



(55) 



(56) 



There are less sequences available for larger energy gap 
design. 



When p{a) is bimodal, 



i r 

C = — In cosh — - 
s 2s T h 



2. Example: Gaussian distribution 
When p{a) is Gaussian, 



T 2 / T d 



(57) 



(58) 



In either case, we see that the number of available se- 
quences drops dramatically as we increase their desired 
quality by choosing a larger Ae. 



VI. CONCLUSION 

In this paper, we examined the interplay of surface 
solvation effects and sequence design for protein-like het- 
cropolymer globule. Ideologically, our treatment of dis- 
ordered sequences followed the theoretical studies of het- 
eropolymer folding in the works 0, [U, 0, U3] , and in our 
treatment of preferential solvation we used the approach 
of the work . What we did is we applied REM in the 
new challenging context. 

As in the volume approximation, designed sequences 
in the target conformation have lower energy than ran- 
dom sequences. This is not surprising: this is after all 
the sole purpose of design. Less obvious, we found that 
the role of preferential solvation for the design itself might 
be controversial. The problem is that when design condi- 
tions favor too strongly the hydrophilicity of the surface 
monomers, these monomers can have an adverse effect on 
the overall composition of the sequence and then disrupt 
the favorable arrangement of contacts inside the globule. 

Speaking about phase diagram of the heteropolymer 
globule, we found that surface solvation effect operates 
differently for the two most typical examples of monomer 
composition. If there are only two types of monomers, 
then glass transition temperature remains independent 
of the design condition, as it was found in volume ap- 
proximation. But this is not longer the case when there 
is a wide Gaussian distribution of monomer types; in 
this case, design brings in a noticeable fraction of very 
hydrophilic monomers from the tail of hydrophilicity dis- 
tribution, and they do affect the glass transition. 

To conclude, our study shows it possible to incorpo- 
rate preferential solvation effects into the the REM-based 
heteropolymer theory, and some of the obtained results 
are quite delicate and unexpected. In reality, the role 
of surface in molecules of realistic sizes is quite signifi- 
cant, so the effects which were examined here on pertur- 
bative level, considering surface contributions (D(K/N) 
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very small, might be quite substantial and very impor- 
tant. 
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APPENDIX A: PROBABILITY DISTRIBUTIONS 
OF THE LOW ENERGY STATES IN REM 

To make this work self-contained, we review here the 
probability distributions of the low lying states in the 
REM. Further details on this subject can be found in 

M. 



1. Ground state energy 

Consider some M. (M = e sN ) statistically indepen- 
dent energy levels, each distributed with probability den- 
sity w(E). The question is this: what is the probability 
distribution, W(E) of the lowest among these M. levels? 
The general answer, due to the statistical independence, 
reads 

W(E)=Mw(E)(J w(E')dE') . (Al) 

Here, w(E) is the probability that there is a level at E, 

(Jb° w(E')dE') M 1 is the probability that all other M — 
1 levels happen to be above E, and factor M. reflects the 
fact that any of the M. levels may play the role of the 
lowest one. 

With a large number A4 ^> 1 of levels taken from 
the distribution w(E), the lowest one of them will surely 
be located somewhere in the low energy tail of the dis- 
tribution w(E). In this region, J™ w(E')dE' is very 
close to unity, or in other words w(E')dE' = 1 — 



J_ oo w(E J )dE' ~ exp 
also 1 compared at M in M 
422): 



- f_ w(E')dE' . Neglecting 



1, we then re- write eq. 



W(E) ~ Mw(E)e- M f-°° w{E ' )dE ' . 



(A2) 



Where is the maximum of W(E), what is the most 
probable ground state energy, E m l The condition 
W'(E) = yields 



Mw(E rn ) = {lnw(E m )Y 



(A3) 



Apart from the logarithmic corrections, this returns the 
familiar condition traditionally written in a careless form 
Mw(E) ~ 1 in which units do not match. 



We now remember that w(E) is Gaussian, 
1 



-E A I2N&B A 



V2irNSB 2 
In this case formula (|A3|) reads 

Mw(E m ) = -E m /NSB 2 , 
which then implies 



(A4) 



(A5) 



\ 



2NSB 2 (lnM-\n 

Ns 



— E 
NSB 2 



V2ttNSB 2 



and by simple iteration we finally have 

SB 



E„ 



2sNSB 



2V2s 
2sN6B + (In N) 



In (4ttNs) 



(A6) 



(A7) 



The leading term here corresponds to dropping the pre- 
cxponential factor of w(E) when writing Mw{E) ~ 1, 
in which case, of course, units match and the result is 
correct. 

It is fairly obvious, and will be verified instantly, that 
W(E) is concentrated around E m . Accordingly, let us 
assume E = E m + e, with |e| <C \E m \. In this small e 
range, we can simplify the Gaussian w(E): 



Mw{E) 



M 



-(E m +e) 2 /2NSB 2 



\/2ttN5B 2 

V2s/5B 



(A8) 



where Tf r = -^=. We further use the Gaussian shape of 
w{E) and the proper asymptotics of the error function 
to write 



M I w(E')dE' ~ Mw(E) 

—E 



e/T b 



(A9) 



Plugging both flXgj) and (TX9]) into eq. (fM)) . we arrive at 
the following probability distribution of the ground state 
energy 

W(E) ~ (l/T fl .)e ie/T! ' ) ' cxpi ' i/Tf ' ) , (A10) 
where, once again, 

SB 



e = (E - E m ) , T fr 



'2s 



(All) 



As expected, W(E) is not symmetric, it decays much 
faster to the right (higher energies) than to the left (lower 
energies). However, the characteristic scale in both cases 
is set by the quantity Tfr = SB j\[2s and does not contain 
N in any form. It is only the parameter-free "shape" 
e^~ e<i that is asymmetric. It is also satisfying that W{E) 
Eq. (|A10[) is correctly normalized. 
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Lowest and second lowest energy states: joint 
distribution 



Consider now the joint probability distribution of the 
two lowest energies. Exact formula for this joint distri- 
bution, similar to Eq. (|A1[) reads 



W(E U E 2 ) = M(M-l)w(E 1 )w(E 2 ) 

/ poo \M-2 

x w(E')dE'J . (A12) 

Similar to Eq. (|A1|) . we rely here on statistical inde- 
pendence of energy levels in REM, which implies that 
we should consider the independent factors of one energy 
level present at E\ (which gives w(Ei)), another energy 
level present at E 2 (yielding w(E 2 )), times the probabil- 
ity that all other M. — 2 energy levels are above E 2 , and 
times the combinatorial factor M. (A4 — 1) which reflects 
the idea that any two of the M. levels can play the role 
of the lowest and second lowest. 

The first step to simplify this distribution is to notice 
that both lowest E\ and second lowest E 2 energy levels 
are practically always located far in the lower energy tail 
of the distribution w(E). Similar to Eq. (|A2|) . we then 
write 



W{Ex,E 2 ) = M 2 w{E 1 )w(E 2 ) 

x exp -M / w(E')dE' 



(A13) 



The next step is to rely on equations (|A8|) and (IA9|) . 
Denoting 

(A14) 



Ei = E m + ei , and E 2 = E m + e 2 , 
and assuming | ei,2 1 <C \E m \, we arrive at 

W(Ei,E 2 ) ~ J- e ( e i/ T fr) + ( £ 2/7f r )-exp( e2 /T fr ) 



(A15) 



3. Some derivative probability distributions 

It is a wonderful rewarding exercise to establish 
that integrating out the first excited state E 2 re- 
turns the ground state distribution (|A10|) . as it must: 
J™W(E 1 ,E 2 )dE 2 =W(E 1 ). 



a. Probability distribution of the second lowest state 



-4 



P(s) 




0.4 




second 
lowest 






lowest /O.Gy 








/ o.i 





-2 



-1 



FIG. 4: Probability distributions for the lowest and second 
lowest energy states in REM. Horizontal axes is £ = e/Tf r , 
and vertical axes is the probability density for this quantity £ 
(so the area under each curve in the figure is unity). 



This distribution is different from that of the lowest en- 
ergy state, Eq. (|A10p . as it is clear from the Figure 
|U It is hardly a surprise that the second lowest energy 
distribution is somewhat shifted towards larger energies 
compared at the distribution of the ground state. 



b. Conditional 'probability of E2 at given E\ 



Suppose ground state energy is fixed at E\ , what is the 
probability distribution of E 2 at the given E\l According 
to the general rules of probabilities, this quantity is equal 
to 



W(£ 2 |£d) 



W{E U E 2 ) 
W{Ex) 



3 (e2/T fr )-cxp(e 2 /T fr )+oxp(ei/T fr 



\A17) 



For understanding, it is useful to mention that this quan- 
tity is normalized by the condition W(E 2 \ E\)dE 2 = 
1. 



For the second lowest energy, we have by simple inte- 
gration 



W 2 (E) = f W(E 1 ,E)dE 1 

J —OO 



^(e/TfO-exp^/Tf,.) 



(A16) 



Probability distribution for the gap between lowest and 
second lowest states 



As regards the gap between two lowest energy states, 
AE = E 2 — Ei , its probability distribution is also easily 
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found: 



W(AE) 

/OO /*oo 
dEi / dE 2 W(E u E 2 )5 (E 2 -E x - AE) 
-BO J Ex 

/do 
W(£ 2 - AE,E 2 )dE 2 
-OO 

e 2 - A£ 



' — OO 

1 



exp 



fr 



,-AE/T b 



1 



de2 



Ce-^C = — e - A£ /^ . (A18) 



Of course, this is the result for AE > 0, so in general 
probability distribution for the gap reads 



W(AE) = 







■AE/Ti r 



for A£ < 
for A£ > 



(A19) 



It might seem surprising at the first glance that this prob- 
ability decays only as exponential, not the sharper e~ cxp * 
function. Indeed, if we imagine gap forming as fixing low- 
est energy at the typical place and then looking at the 
second lowest level, then the probability of the gap de- 
cays in a very sharp, double exponential way. However, 
if we think differently, that E 2 is fixed in a typical place, 
and then gap is determined by where is the E\ , then the 
probability of E\ going down is just exponential. Thus, 
from this argument we gain the following insight. The 
probability of the gap is exponential because the domi- 
nant possibility for the gap to be large is having unusually 
low E\ rather than high E 2 . Thus, large gap is typically 
the result of the low ground state. 



4. Averages 



where £ = e/Tf r and r\ = e^. As a fact of mathematical 
curiosity, there appears V, the derivative of the Euler T- 
function - rather infrequent guest in physics calculations. 
However exotic mathematically, this term is smaller than 
the logarithmic correction term in (|A7|) and so it must 
be neglected if the first approximation is used for E m . 



We can also compute the average value of the second 
lowest energy: 



(E2) = 



EW 2 (E)dE 



E n 



SB 



rj In r\e ,] dq . (A21) 







r'(2)=i+r'(i)«o.4 



This is greater than the average ground state energy 
(|A20jl by just the amount T it = SB/V^s- 



The latter result is also consistent with the fact that 
the average gap, according to eq. (|A19|) is equal to 



Using formula (|A10|1 . it is easy to compute the expec- 
tation value of the ground state energy: 



(E g ) = / EW(E)dE 

J —OO 

/OO 
-OO 

= E m + -== / hxTje'^dr] , (A20) 
V2s Jo 



(AE) = T fr = SB/Vis 



(A22) 



r'n)R!-n.fi 
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